Improving Japanese language models using POS information
نویسندگان
چکیده
In this paper, part-of-speech (POS) information is used to improve the performance of a Japanese language model (LM). The POS bigram is used to tackle the sparseness problem of the training data. Additionally, due to the characteristics of the Japanese language, part of the Japanese syntax information can be integrated into the POS bigram, through POS combination rules. Based on the Japanese syntax grammar, the POS combination rules determine if a POS pair is prohibited in Japanese language. The Japanese POS bigram table not only includes the POS pairs that occurred in the training corpus, but also includes all the prohibited POS pairs. The confusion in the search space can be reduced by explicitly modeling the prohibited POS pairs. In this work, a series of experiments have been carried out to investigate the impact of the POS bigram with prohibited POS pairs on the recognition search space. The framework of fast generation of the language model look-ahead (LMLA) probabilities based on POS bigram information is also presented in this paper. The experimental results showed that compared to the traditional word n-gram model, the LM with POS bigram information achieves significant improvement in both word accuracy and the speed of Japanese LVCSR system.
منابع مشابه
An improved joint model: POS tagging and dependency parsing
Dependency parsing is a way of syntactic parsing and a natural language that automatically analyzes the dependency structure of sentences, and the input for each sentence creates a dependency graph. Part-Of-Speech (POS) tagging is a prerequisite for dependency parsing. Generally, dependency parsers do the POS tagging task along with dependency parsing in a pipeline mode. Unfortunately, in pipel...
متن کاملUsing Collocations and K-means Clustering to Improve the N-pos Model for Japanese IME
Kana-Kanji conversion is known as one of the representative applications of Natural Language Processing (NLP) for the Japanese language. The N-pos model, presenting the probability of a Kanji candidate sequence by the product of bi-gram Part-of-Speech (POS) probabilities and POS-to-word emission probabilities, has been successfully applied in a number of well-known Japanese Input Method Editor ...
متن کاملسیستم برچسب گذاری اجزای واژگانی کلام در زبان فارسی
Abstract: Part-Of-Speech (POS) tagging is essential work for many models and methods in other areas in natural language processing such as machine translation, spell checker, text-to-speech, automatic speech recognition, etc. So far, high accurate POS taggers have been created in many languages. In this paper, we focus on POS tagging in the Persian language. Because of problems in Persian POS t...
متن کاملN-gram Language Modeling of Japanese Using Prosodic Boundaries
A new method was developed to include prosodic boundary information into statistical language modeling. This method is based on counting word transitions separately for the cases crossing accent phrase boundaries and not crossing them. Since direct calculation of the above two types of word transitions requires a large speech corpus which is practically impossible to make, bi-gram counts of par...
متن کاملImproving Chunking by Means of Lexical-Contextual Information in Statistical Language Models
In this work, we present a stochastic approach to shallow parsing. Most of the current approaches to shallow parsing have a common characteristic: they take the sequence of lexical tags proposed by a POS tagger as input for the chunking process. Our system produces tagging and chunking in a single process using an Integrated Language Model (ILM) formalized as Markov Models. This model integrate...
متن کامل